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Abstract 

We present an application of a particular machine-learning method (Boosted 
Decision Trees, BDTs using AdaBoost) to separate stars and galaxies in pho¬ 
tometric images using their catalog characteristics. BDTs are a well estab¬ 
lished machine learning technique used for classification purposes. They have 
been widely used specially in the field of particle and astroparticlc physics, 
and we use them here in an optical astronomy application. This algorithm is 
able to improve from simple thresholding cuts on standard separation vari¬ 
ables that may be affected by local effects such as blending, badly calculated 
background levels or which do not include information in other bands. The 
improvements are shown using the Sloan Digital Sky Survey Data Release 
9, with respect to the type photometric classifier. We obtain an improve¬ 
ment in the impurity of the galaxy sample of a factor 2-4 for this particular 
dataset, adjusting for the same efficiency of the selection. Another main goal 
of this study is to verify the effects that different input vectors and training 
sets have on the classification performance, the results being of wider use to 
other machine learning techniques. 
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1. Introduction 


Object classification in photometric images is an important first step in 
any analysis based on catalogs from such sources, as it constitutes a funda¬ 
mental tool to build the set to be used for model comparison or parameter 
estimation. In particular, for cosmological analyses, a significant fraction 
of stars contaminating the galaxy sample can change the amplitude of the 
galaxy power spectrum. If this misclassified population (represented by the 
impurity fraction I) is spatially unclustered, the amplitude of the power spec¬ 
trum is changed by a factor (1 — I)' 2 and errors must be increased to account 
for it, or a correction has to be applied. A well determined clustering ampli¬ 
tude is key for measuring effects such as the galaxy bias from a specific galaxy 
population (Coupon et al. 2012), understanding large-scale cosmological ef¬ 
fects versus a systematic stellar contamination component (see for example 


Thomas et al. (2011) and Ross et al. (2011)) or distinguishing cosmological 


models with primordial non-Gaussianities (Giannantonio et al. 2014) 


Star-galaxy classification has been addressed using many different mor¬ 
phology based cuts since the existence of the first photographic plate surveys 
(MacGillivray et al. (1976), Sebok (1979), Heydon-Dumbleton et al. (1989), 


Maddox et al. (1990)) and with more sophisticated techniques with the ad¬ 


vent of digital imaging, machine learning methods (Odewahn et al. (1992), 


Weir et al. (1995), Miller and Coe (1996), Bertin and Arnouts (1996)) and 


exponentially increasing computational power. Most of the implementations 
have addressed the problem from the morphological point of view too. Multi¬ 
band imaging surveys, such as the Sloan Digital Sky Survey (SDSS) or the 
Canada-France-Hawaii Telescope Legacy Survey (CFHTLS), have opened 
up the possibility of adding color information as input variables (henceforth 
termed features ) for the classifier. This is explored in |Ball et al. (2006) for 
SDSS Data Release 3 (DR3) and in Hildebrandt et al. (2012) for CFHTLenS 
and to select a pure star sample for Milky Way studies using SDSS DR7 in 


Fadely et al. (2012). Recently, in Malek et al. (2013), the authors performed 


a study in classification using Support Vector Machines with VIPERS data as 
training set, highlighting the importance of adding infrared data to enhance 
the classification. 

In this paper, we investigate the usage of AdaBoost Boosted Decision 
Trees as star-galaxy classifiers, and test their performance in galaxy selection 
against the standard SDSS morphological selection in SDSS Data Release 9. 
We use this popular flavor of decision trees to address this issue for the first 
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time on optical catalog information, where we have broadened the scope of 
input features, to use color and morphological information simultaneously. 
Beyond optimizing the tree parameters, the goal is to study the influence of 
color and morphological information separately, and the influence of different 
sizes and depth of training sets, which are required by any empirical-based 
classifier. 

Decision Trees (DTs) have been explored thoroughly in the past for this 
purpose, as described in Suchkov et al. (2005) who were the first to apply a 
DT to separate objects from the SDSS-DR2. Later, in Ball et al. (2006) an 
axis-parallel decision tree was applied, using almost 500k objects from SDSS- 
DR3 with an extensive exploration of parameters using as input features the 
colors of the objects, for the range up to r = 20. In Vasconcellos et al 


(2011) the authors broadened the scope of this work by comparing 13 different 
Decision Tree algorithms up to r = 21 and using SDSS DR7 as testbed, but 
limiting to morphological parameters. 

Boosted Decision Trees, introduced in Freund and Schapire (1997), have 


been used very successfully in high energy physics 


particle classification in MiniBooNE (Yang et al. 


Roe et al. (2005) including 


2005), CMS data for iden¬ 


tification of the Higgs particle (CMS-Collaboration, 2012), AMS (Aguilar 


et al. 2013) and Fermi (Fermi-LAT-Collaboration, 2012). In optical astron¬ 


omy, an application has been developed to extract photometric redshifts from 
imaging surveys (Gerdes et al. 2010), outperforming implementations based 
on neural networks. They have also been used for artifact identification in 


supernovae searches (Bailey et ah, 2007) 


The paper is structured as follows: in Section [2j BDTs and the specific 
implementation we have used are detailed. In Section [3j we describe the 
dataset employed, data features chosen, training, evaluation and test sets. In 
Section [I] we detail the approach for the optimization of the tree parameters 
for our specific problem, i.e., obtaining high purity galaxy samples. We 
show our results for the best parameter set in Section [5] and we compare the 
performances for different training sets and feature selection. Then we end 
with some conclusions and possible lines of future work. 


2. Boosted Decision Trees 

A Decision Tree is a structured classifier which makes step-by-step choices 
based on a single feature describing the data. A series of sequential cuts is de¬ 
vised to separate the data into one of two categories: signal and background. 
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The value of the cuts, the feature used and the order in which they are ap¬ 
plied, are established using a training set. The process continues through 
these nodes until a final node (leaf) is reached. 

The training process starts at a root node with an arbitrary choice of 
feature and value of the cut. The separation into signal and background is 
done according to this criterion and a separation power 9 is evaluated. In this 
case, we use the Gini index to determine the performance of this particular 
choice: 


G = p-(l-p) (1) 

where p is the purity of the selected sample (whether it be signal or 
background). Using the index P for the parent node and the indices s and b 
for the signal and background daughter nodes, we determine the best choice 
of feature and value of the cut which maximizes: 

9 = abs(G P - (G s + G b )) (2) 

Every input feature is scanned, using a predetermined number of cuts 
for each (parameter newts), to look for the best pair at each node. Thus 
the configuration of the tree continues until a minimum number of data 
points in a particular node is reached (parameter nevmin ) or if the number of 
consecutive nodes reaches a predetermined maximum (parameter maxdepth). 

Decision Trees are known to be a powerful but unstable learning method, 
i.e., a small change in the training sample can translate into a large change 
in the tree and the result of the classification. In addition, a theoretically 
’perfect’ classification can be achieved if the tree is allowed to develop fully 
so that each leaf only contains signal or background data points, therefore 
separating fully the dataset. Of course, this is only an accurate description of 
the training set, which most probably will not be descriptive of new data, as 
it has incorporated all the noise inherent to that specific data (overfitting). 

Boosting is a way of enhancing the classification performance and in¬ 
creasing the stability with respect to statistical fluctuations in the training 
sample, as well as to avoid the overfitting problem. If a training data point 
is misclassified in a leaf, a weight is assigned to that data point, according 
to: 


w = 


1 — e 


e 


( 3 ) 
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where e is the misclassihcation rate of the tree. The weight w is assigned 
to all such data points and a second tree is generated anew, with the origi¬ 
nal dataset using these weights instead (well classified values keep a weight 
value w = 1). The process is iterated tens or hundreds of times (parame¬ 
ter ntrees ), with all the resulting trees combined into a ’forest’ to provide 
significantly enhanced classification power. This is the so-called AdaBoost 
technique (Freund and Schapire, 1997). With this forest of trees at hand, 


the classification of a single data point is performed based on the majority 
vote of the classifications done by each tree. 


We have used the Toolkit for Multivariate Analysis framework (Hoecker 


et al. 2007), provided with the ROOT analysis package (Bran and Rade- 


rnakers, 1997), widely used in high energy physics with great success. This 


framework has been used in other astrophysical applications such as the 
ArborZ photometric redshift code described in Gerdes et al. (2010). It is 


specially designed for processing the parallel evaluation and application of 
different multivariate classification techniques, among which are AdaBoost 
Boosted Decision Trees. 

A first test was performed on a training sample based on SDSS DR7 
data (Etayo-Sotos and Sevilla-Noarbe 2013) using several of the methods 
described in the package, with some standard, default values. The results 
are shown in Figure [I] via the Receiver Operator Characteristic (ROC) curve 
which measures the true positive rate versus the false positive rate of the 
classifier for different thresholds. The BDTD method (which is a Boosted 
Decision Tree with a prior step of input feature decorrelation) turns out to 
have the best performance for this problem and training set. The decorrela¬ 
tion step takes care of linear correlations between the input features (vector 
x) by computing the square root S of their covariance matrix and construct¬ 


ing a new input feature vector 
which were compared are: 


= S 


-i 


x. 


The other standard methods 


• k-Nearest Neighbors ( kNN ): a method which searches for the k closest 
training events in feature space. 

• Fisher Discriminant ( BoostedFisher ): a linear discriminant analysis in 
which an axis in feature hyperspace is determined so that signal and 
background are as separated as possible. 

• Neural Network ( MLP ): a multi-layer standard perceptron implemen¬ 
tation of this classic technique, in which a non-linear mapping of the 
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input feature vector is done onto a one-dimensional space as well. This 
is done through a complex mesh of cells which react to the input vari¬ 
ables and modify their final classification accordingly. 


This result, coupled with the success of this specific implementation in re¬ 
cent particle physics literature, pushed us to choose this machine learning 
algorithm for our tests. 

Random Forests are a particularly successful technique too in the held of 


classification and regression in astronomy (see, e.g., Carrasco Kind and Brun¬ 


ner 


(2013)). They have better generalization properties as they can account 
for some scatter from the training set to the application set. On the other 
hand, AdaBoost BDTs can outperform slightly if the training set is repre¬ 
sentative enough. In recent tests with photometric data both give similar 
performances for classification (I.Sevilla-Noarbe and the DES Collaboration, 
in prep, and AlSayyad et al. (2015) as well as Y.Al-Sayyad, private communi¬ 
cation). For supernovae candidate identification, random forests and boosted 
decision trees also compete for best performance with variable results (see 


Bailey et ah (2007) or Goldstein et al. in prep.). 


3. Dataset 


We have used this implementation of BDTs on Data Release 9 (DR9) 


of the Sloan Digital Sky Survey (SDSS) (Ahn et al. 2012)). The data was 
downloaded using the DR9 Catalog Archive Server making a selection on 
modelmag_r from 14 to 23 and using only spectroscopically matched objects 
from the photometric table, to provide a truth value for the purposes of 
evaluating the algorithm. 

Several shape features in the r-band and several magnitude measurements 
in all bands for bands u , g , r, i, z have been used. 

We have limited the shape information to only one band as, in first ap¬ 
proximation, the values for these parameters across bands should be quite 
compatible. With respect to flux information, we have used a range of dif¬ 
ferent magnitude types (fiber, model, petro, psf) for bands u,g,r,i,z. 

Finally, we include the photometric SDSS classification (type) for the 
object, as well as the spectroscopic classification (class) which we use as the 
reference (truth) value for performance in terms of purity and completeness 
for this work. 
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Signal efficiency 


Figure 1: Efficiency vs purity plot (ROC curve) for different machine learn¬ 
ing methods in TMVA applied to a SDSS DR7 training sample described 
Etayo-Sotos and Sevilla-Noarbe (2013). BDTD - decorrelated Boosted 


m 


Decision Trees - shows the best behavior. 


In Table [T| we summarize all the photometric catalog features used. The 
specific selection is provided in the Appendix and the resulting catalogs pro¬ 
vides a total number of 2195172 objects. 

One of the reasons for choosing this range of magnitudes and features 
is also to allow for an easier comparison with the performances quoted by 
(Vasconcellos et ah, 2011), which we have used as reference. In this case, the 
authors performed a thorough testing of different Decision Tree flavors. The 
original AdaBoost Boosted Decision Tree implementation we chose is not 
contemplated in their study, and we will quantitatively compare the results 
obtained, though our goal is to understand the impact of different choices of 
features and training set characteristics. 

We randomly sampled the resulting catalog into training, evaluation and 
test samples. 


• 200000 objects went into the training sample to have a variety of smaller 
training sets and measure training sample size dependency. From this 
sample, the TMVA framework (see Section [4]) uses a specified amount 
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Table 1: List of input features of the SDSS catalogs used for the training. 
Shape parameters taken only from the r band. Color parameters include all 
types of magnitudes: fiber, Petrosian, model and PSF. 


Parameter 

Type 

Description 

petroR50_r 

Shape 

Radius containing 50% 
of Petrosian flux 

petroR90_r 

Shape 

Radius containing 90% 
of Petrosian flux 

lnlstar.r 

Shape 

Logarithm of likelihood of fit to 

PSF shape 

lnlexp_r 

Shape 

Logarithm of likelihood of fit to 
an exponential profile 

lnldev_r 

Shape 

Logarithm of likelihood of fit to 
a deVaucouleurs profile 

mel_r 

Shape 

Ellipticity component 1 

me2_r 

Shape 

Ellipticity component 2 

mrrccxr 

Shape 

Sum of second moments of object 

fibermag-[ugriz] 

Magnitude 

Magnitude as measured 
using the optical fiber aperture 

petromag_[ugriz] 

Magnitude 

Petrosian magnitude in each band 

modelmag_[ugriz] 

Magnitude 

Magnitude as best measured by 
either exponential or deVaucouleurs profile 

psfmag_[ugriz] 

Magnitude 

Magnitude as measured using the local PSF 

mag_[u]-mag_[g] 

Color 

u — g color 

mag_[g]-mag_[r] 

Color 

g — r color 

mag_[r]-mag_[i] 

Color 

r — i color 

mag_ [i]-mag_ [z] 

Color 

i — z color 




for actual training. This is useful for the comparison of different codes 
in the same execution. 

• 800000 objects went into the evaluation sample, which we use to opti¬ 
mize the classifier parameters. 

• The rest (1195172) conform the testing sample which is the one actually 
used to evaluate real performance. 

In Figure [2] the magnitude distribution for the objects in the catalog is 
shown, in this case, for the training set (same relative distributions as for the 
evaluation and testing sets). 



Figure 2: Number count distribution of stars and galaxies in the training set 
for the downloaded SDSS DR9 catalog. 


4. Methodology 

To measure performance, we define the efficiency (E) and impurity (I) of 
the galaxy sample as: 
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E(m) 


at selected 
galaxies 
AJ total 
1 galaxies 


x 100 
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ATselected 
v stars 
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1 v star s-\- galaxies 


x 100 


( 4 ) 

( 5 ) 


where {N^f ed , N^ia^ies) is the number of true (stars, galaxies) selected 
by the classifier at magnitude m and N^ffJ xies corresponds to the total sample 
of true galaxies. True stars and galaxies are determined according to its 
spectral classification in the SDSS catalog via the class parameter. 

These metrics show a method of direct comparison against science case 
requirements, usually expressed in these terms (or, equivalently, completeness 
and purity). In our case, we will be concerned with obtaining the lowest 
impurity from stars possible in our galaxy sample, given a fixed efficiency 
value. 

To optimize and test the behavior of this classifier, we have followed these 
steps: 


1. Train and evaluate on the training and evaluation sets in a grid of BDT 
parameters. Select best set in terms of performance (impurity level for 
a given efficiency). 

2. Evaluate the performance on the evaluation set for different training 
set sizes and depths, as well as the computation times. 

3. Test the chosen configuration against the photometric type performance 
provided by the SDSS catalogs with the test set. 

4. Verify the impact of a different choice of features assuming the same 
parameters and training set size are valid. We will implicitly assume 
here the independence of the BDT parameters with respect to these 
choices. 


The BDT parameters to be tuned are described below. The values of the 
grid are shown in Table [2j based on previous experience (Etayo-Sotos and 


Sevilla-Noarbe, 2013): 


• ntrees: Number of decision trees involved in the computation. 


• nevmin: Minimum number of events held in a leaf. 


• maxdepth: Maximum size of the tree, in terms of steps from the first 
decision. 
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Table 2: List of Boosted Decision Trees parameters as named in the TMVA 
environment and the values for which their performance is evaluated. In bold 
face, the selected values after parallel coordinate analysis (see text). 


Parameter 

Range 

ntrees 

nevmin 

maxdepth 

ncuts 

200,400,1000,2000,3000 

10.50.100.400.1000 
5,10,15,20,30 

20.50.200.500.1000 


• ncuts: Number of bins used for the cuts in each feature being tested. 

This grid was explored by submitting multiple batch jobs to a cluster 
both for the training and evaluation sets, as defined in Section [3j We have 
narrowed down to this particular set of values in each case after a test in a 
wider range, the limits being imposed by performance and relative gain with 
respect to execution time. 

The computation was performed using the Euler cluster at the CIEMAT 
Spanish national lab, in Madrid. This cluster is composed of 144 nodes with 
2GB RAM and two Quad-core Xeon processors each at 3.0 GHz clock speed. 

5. Results and discussion 

We will now use the impurity metric defined in equation [5] as the value 
to compare the performance of each Boosted Decision Tree set we produce, 
as well as for the SDSS type parameter. In order to compare fairly with the 
latter we have adjusted the selection cut for each BDT so that its efficiency 
(equation [ 3 ]) was within 0 . 1 % of the efficiency found for the type classifier at 
that particular magnitude bin (in raodelmag_r). 

In this section we detail the strategy to select the parameters for the BDT, 
as well as the impact of a varying training set size, composition, depth and 
feature selection, which can also provide hints for the expected performance 
of other machine learning methods which extract the same information and 
relationships in the data. 

5.1. Selection of optimal parameters for the BDTs 

We have executed the 625 jobs of the parameter grid, corresponding to 
all combinations of parameters in Table [2]) , and obtained a 9-element vector 
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with the impurity level for each magnitude bin (modelmag in r-band). We se¬ 
lect the parameter set which provides the best (lowest) overall impurity level 
in the evaluation set (boldface in Table [2]). In order to do so, we visualize 
the performance of all possible combinations through a parallel coordinates 
plot such as the one shown in Figure |3j This type of representation allows 
showing a set of points in a IV-dimensional space, with four input parameters 
and a 9-component vector output, so that N lines are drawn, each encom¬ 
passing the whole range of each input and output value. A point in this 
A-dimensional space is represented as a polyline with vertices on the parallel 
axes. In our case, the first four points connected by the polyline represent 
the input parameter values of Table [2j and are connected by continuing the 
polyline to the 9-output components of the impurity vector, which is the 
metric we are using. With this tool, it is possible to distinguish the overall 
performance of a particular combination of input values for the parameters 
against the background of other possible combinations. In the figure, the 
specific combination which provides a good overall impurity level through¬ 
out the magnitude range is shown with a thicker line. This combination is 
highlighted in boldface in Table |2j 

Examples of the effect of the change of specific features are shown in Fig¬ 
ure [4j The increase in the number of trees, tree depth and number of cuts in 
each feature decrease the impurity level achieved until a certain stable value 
beyond which there is no significant gain though we incur in an execution 
time penalization as well as increasing the risk of overfitting (although the 
boosting approach tries to avoid this). 

The training set size used was subselected to include 30000 randomly- 
picked galaxies and 6000 stars likewise chosen. This size provides a suitable 
trade-off between computation time and training performance. We will show 
in Section 15.21 the results when these values are modified. 

5.2. Dependence with training set size 

The number of samples, both for signal and background, is directly related 
to the performance of the classifier, as an increasingly varied array of galaxy 
and star types are covered. The fact that we have used an unbiased sample, 
in the sense that no targeting has been specifically done for these objects, 
and covering a wide area, which diminishes the impact of sample variance, 
makes this catalog an exceptional resource, up to the available depth. These 
characteristics allow us to understand the impact of the training sample solely 
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ntrees nevmin maxdepth ncuts 114 115 116 117 118 119 120 121 122 


Figure 3: Parallel coordinates visualization showing the relationships of the 
four input parameters (first four axes) with the impurities obtained at each 
magnitude range (last nine axes). The thicker line corresponds to the choice 
of parameters and the resulting impurity levels on the evaluation set, chosen 
for this paper. 


in terms of its size, without worrying about the specific kinds of objects which 
populate the sample. 

In Figure [5] we show the evolution of the impurity level (our chosen com¬ 
parison metric) with respect to the size of the training set, as well as the 
relative mixing of galaxies and stars. In Table [3] the training times for dif¬ 
ferent source and background sample sizes are shown. 

By studying both results, a good compromise in terms of speed and im¬ 
purity metric is the choice of using 30000 galaxies and 6000 stars. Increasing 
the number of galaxies does not improve things much and on the other hand, 
not providing sufficient number of stars to have a well balanced sample can 
ruin our impurity performance, as the classifier will tend to classify objects 
as galaxies. This can be seen in the lower right panel of Figure [5] or in 
any panel, when the available star sample is only populated by 500 objects, 
leaving a small star-to-galaxy ratio which will tend to make objects to be 
classified predominantly as galaxies, as some specific stellar types may have 
been randomly left out or are underrepresented. 
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Figure 4: Impurity level comparison with variations of a single parameter. 


Note well that the process of selecting the most adequate training set size 
and mix, as well as the one with best performance (Section 5.1) has been 
through an iterative process in which a default set of parameters were used 
with varying sample size, then the most adequate parameters were chosen, 
again training sample size sensitivity was reanalyzed, etc. 
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Figure 5: Impurity level for 3000 (upper left), 10000 (upper right), 30000 
(lower left) and 100000 (lower right) galaxies in the training set, and variable 
star sample. 


Table 3: Training times (in seconds) for different choices of galaxies and 
stars, source and background respectively, in the training sample. 


Training time (s) 

Nb. Galaxies in training 

Nb. stars in training 

1000 

3000 

10000 

30000 

100000 

500 

36 

89 

201 

599 

2550 

3000 

118 

203 

280 

709 

2980 

6000 

253 

346 

355 

795 

3030 

12000 

547 

322 

489 

1160 

3230 


5.3. Dependence with training set depth 

A common circumstance that many present and future photometric sur¬ 
veys will face is the lack of adequate training for their machine learning based 
classifiers, due to the unavailability of overlapping areas with spectroscopic 
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information reaching the full survey depth. In this section, we experiment 
with variations in the availability of training samples depending on the mag¬ 
nitude limit we impose to them, and verify the impact on the impurity of 
the galaxy sample. 

In Figure [6] the results for different choice of training depths are shown. 



Figure 6: Effect of the usage of different depths for the training sample, 
defined in terms of a limit to modelmag_r. 

It can be verified that such variations are indeed significant and more 
important specially at magnitudes deeper than where training was available. 
In fact, a morphological cut approach, such as the SDSS photometric type 
can be as valid as a machine learning method employing multiple features, 
if the training is not deep enough. 

5.4- Dependence with features 

It is interesting to explore what the different features from Table [T| con¬ 
tribute to the overall result. We separate for this study the available features 
into shape-related, magnitude-related and color-related, as specified on the 
second column of the aforementioned table. 

In Figure [7] the impact of different choice of features are shown. 
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Model mag r-band 


Figure 7: Effect of the usage of different kinds of input features on the galaxy- 
purity classification, as compared with the SDSS type parameter, based on 
’flux concentration’ (modelmag - psfmag). 


Color and magnitude information are clearly the most important sources 
from which the BDTs cull their information. Using shape features (light 
radii, ellipticity, fit likelihoods to models), on the other hand, cannot com¬ 
pete with the clues provided by a concentration parameter, which proves to 
be a robust measurement. Indeed, adding this combination to the shape 
input features is an important improvement to the classifier, and provides a 
similar response. Therefore, when using single band information, a simple 
cut on a ’concentration’-like parameter (e.g. differences in fluxes from PSF 
magnitudes and model magnitudes, or the SPREAD_MODEL parameter show¬ 
cased in Desai et al. (2012) and Soumagnac et al. (2015)) should be enough. 


5.5. Evaluation against SDSS DR9 photometric type 

To test independently of the training and evaluation set, we have used 
the test set, with the chosen parameters and training size. The results are 
shown in in Figure [8j to be compared with the results for the photometric 
type provided with the catalog. This classifier has a similar performance 
to the standard psfmag-cmodelmag > 0.145 cut[^} Our suggested BDT ap- 


2 https://www.sdss3.org/dr8/algorithms/classify.php 
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proach provides an improvement of 2 to 4 times on the impurity level, with 
this relatively simple training approach, using color, magnitude and shape 
information. 



Figure 8: Comparison of the impurity in the galaxy sample in SDSS DR9 for 
the standard type classifier, and the method presented in this paper using 
Boosted Decision Trees. 


We can compare our results with the ones reported by Vasconcellos et al. 


(2011) by fixing the efficiency values to approximately the same ones they 
show in Figure 6 of their paper. We obtain impurity levels around or below 
1% (except of ~ 2% at the magnitude bin of 14-15), maybe slightly smaller 
than what it is shown for their DT classifier. However, their dataset is 
shallower, and their choice of parameters is morphological, further evidencing 
the conclusions of our work in terms of dependency of performance with depth 
and feature selection. 

The separation power of the BDT method can be qualitatively appreci¬ 
ated in Figure |9j 


6. Conclusions 

In this work we have showed the improvements that we obtain by applying 
AdaBoost Boosted Decision Trees on SDSS DR9 photometric data to classify 
objects as stars or galaxies, using colors as well as morphological features, 
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Figure 9: PSF - model magnitude (left) and BDT (right) separation variables 
of spectroscopic stars and galaxies, as a function of the modeled magnitude 


in the r band (for SDSS DR7 objects, from Etayo-Sotos and Sevilla-Noarbe 
(2013)) 


using a prior feature decorrelation step. This technique, very successful in 
other fields akin to astrophysics, has never been applied, to our knowledge, 
for this optical astronomy application. Using spectroscopic data from the 
survey itself, we have tuned the parameters for best performance, and then 
compared against the usage of the standard type photometric parameter from 
the SDSS catalogs, obtaining up to a factor 4 improvement in the impurity 
of the galaxy sample. 

In addition to this we have made a few variations on the training sample 
to verify the impact on the classification. These results have general validity 
for other machine learning classifiers, which rely on the same available catalog 
information. 

The BDT parameter choice has been done scanning through different 
values, and using different training set sizes. Using a parallel coordinates 
plot for this kind of analysis has proven to be a very simple and useful tool 
which is widely extensible to other machine learning approaches. For this 
unbiased sample, which would cover a varied array of galactic and stellar 
types, we notice that beyond 30k objects for training, the improvement is 
not relevant. 

Training set mixture is a desirable feature of the training set, as too many 
galaxies in the sample tend to fool the classifier oftentimes into assigning stars 
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as galaxies. Therefore, a relevant presence of background objects (in our case, 
at least 20%) is necessary to ensure an applicable training. 

Colors and magnitudes are the most important features used by the BDTs 
to improve the performance over the morphology-based SDSS type , though 
the latter, proves to be a simple and robust figure (based on concentration 
of light) which is easy to implement and can be sufficient for many studies, 
as has been proven in the literature. Using light concentration together with 
shape information in this machine learning implementation simply converges 
to the standard type classification. It is the addition of information of color 
space that gives the additional edge. 

Extensions to this work include new object types such as quasars (more 
relevant on next generation deeper surveys) or image artifacts. Including 
photometric redshift as an input feature is also an alternative avenue to pur¬ 
sue as a ’color’ selection. Finally, a veritable improvement of this classifier 
would be incorporating it into a Bayesian framework. This way, the com¬ 
putation of correlation functions for example, that made use of the survey 
galaxies would not have to have a sample previously selected, but incorpo¬ 


rate all objects with an associated probability. See for example Fade ly et al. 
(2012), Carrasco Kind and Brunner (2014), Kim et al. in prep. This would 
be an asset too for weak lensing measurements, in which a contamination 
of the shear catalog by stars introduces an additive bias in the shear-shear 
correlation function (E. Sheldon, private communication). 

The code used is made available^ wit h this publication. I requires previous 
installation of the ROOT frameworkQ We used version 5.18 for all our tests. 
The dataset can be downloaded using the query in the Appendix and is also 
available onlincB 
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Appendix: Query to obtain the catalog 

SELECT 
p.ra, p.dec 

p.petromag_u, p,petromag_g, p.petromag_r, p.petromag_i, 

p.petromag_z, p.modelmag_u, p.modelmag_g, p,modelmag_r, 

p.modelmag_i, p .modelmag_z, p.psfmag_u, p.psfmag_g, 

p.psfmag_r, p.psfmag_i, p.psfmag_z, p.fibermag_u, 

p.fibermag_g, p.fibermag_r, p.fibermag.i, p.fibermag_z, 

p.petrorad_r, p.petror50_r, p.petror90_r, p.lnlstar_r, p.lnlexp_r, 

p.lnldev_r, p.mel_r, p.rae2_r, p.mrrcc_r, s.class 

into mydb.SGtable 

from dr9.PhotoPrimary 

AS p 

JOIN dr9.Spec0bj 
AS s 

ON s.bestobjid = p.objid 

WHERE p.modelmag_r between 14.0 and 23.0 
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